机器学习实战四:好事达保险索赔预测 Allstate Claims Severity (xgboost)

您所在的位置:网站首页 韩寒 孔子 机器学习实战四:好事达保险索赔预测 Allstate Claims Severity (xgboost)

机器学习实战四:好事达保险索赔预测 Allstate Claims Severity (xgboost)

2024-06-29 08:44| 来源: 网络整理| 查看: 265

好事达保险索赔预测 Allstate Claims Severity (xgboost)

在这次Machine Learning中,我用了一个在学校做的一个项目来进行实战,当时老师给的数据还是比较小的,但是也还好哈哈哈,当然这个也在kaggle上有一个competition - > Allstate Claims Severity 在这次中,我希望我能学习到xgboost的算法,这个多次在kaggle斩获第一的算法,希望这次以后,能对xgboost有更加清晰的认识,也希望能在之后的实战中能得到更好的结果

如果想了解更多的知识,可以去我的机器学习之路 The Road To Machine Learning通道

目录 OverviewDataRead In DataIsNULLContinuous vs Caterogical featuresData Processing连续值特征离散型特征特征工程建立模型LinearRegressionLinearSVR GBDTXGBOOST

Overview

When you’ve been devastated by a serious car accident, your focus is on the things that matter the most: family, friends, and other loved ones. Pushing paper with your insurance agent is the last place you want your time or mental energy spent. This is why Allstate, a personal insurer in the United States, is continually seeking fresh ideas to improve their claims service for the over 16 million households they protect. Allstate is currently developing automated methods of predicting the cost, and hence severity, of claims. In this recruitment challenge, Kagglers are invited to show off their creativity and flex their technical chops by creating an algorithm which accurately predicts claims severity. Aspiring competitors will demonstrate insight into better ways to predict claims severity for the chance to be part of Allstate’s efforts to ensure a worry-free customer experience.

其实简单来说,好事达保险公司不断寻求新的模型改进他们为家庭提供的索赔服务,现在需要创造一个能够准确预测索赔程度的回归模型来调整其技术能力。 好事达目前正在开发预测索赔成本和严重程度的自动化方法。在这一招聘挑战中,Kagglers被邀请展示他们的创造力,并展示他们的技术能力,创造出一种能准确预测索赔严重程度的算法。有抱负的竞争对手将展示洞察更好的方法来预测索赔的严重程度,这是好事达努力的一部分,以确保一个无忧的客户体验。

Data

Each row in this dataset represents an insurance claim. You must predict the value for the ‘loss’ column. Variables prefaced with ‘cat’ are categorical, while those prefaced with ‘cont’ are continuous.

数据集中的每一行表示一个保险索赔。必须预测“loss”列的值。以’cat’开头的变量是离散型变量,而以’cont’开头的变量是连续变量。

Read In Data

首先还是需要导入Package

import pandas as pd import seaborn as sns import matplotlib.pyplot as plt import numpy as np import warnings warnings.filterwarnings('ignore')

然后读入数据

train = pd.read_csv('data files/train.csv') test = pd.read_csv('data files/test.csv') # 为了显示所有的行和列 pd.set_option('display.max_rows',None) pd.set_option('display.max_columns',None)

接着我们就观察一下数据的分布和类型

train.info() train.head()

从下图可以看出来,一共有188318条数据,一共有132个特征,分别从id到loss 在这里插入图片描述

IsNULL pd.isnull(train).values.any() # False

很好,没有缺失值,我们就不用对数据进行一个缺失值处理了

Continuous vs Caterogical features

刚刚在train.info中大概看到了数据的类型,接着我们现在统计离散型和连续型变量的数目,然后进行分析可以得到,有一些的类型是离散型object,有一些的类型是float64,就是连续型,我们可以根据进行选择去统计一下离散型和连续型变量的数目,并存在两个列表当中

# 统计离散型变量的数目 cat_features = list(train.select_dtypes(include=['object']).columns) print('Categorical:{} features'.format(len(cat_features))) # Categorical:116 features # 统计连续型变量的数目 cont_features = [cont for cont in list(train.select_dtypes(include=['float64']).columns) if cont not in ['loss']] print('Continuous: {} features'.format(len(cont_features))) # Continuous: 14 features

我们看到,大概有116个种类属性(如它们的名字所示)和14个连续(数字)属性。 此外,还有ID和赔偿。总计为132列。

接着,为了对离散型数据更加清楚,我们需要知道他的数目

cat_uniques = [] for cat in cat_features: cat_uniques.append(len(train[cat].unique())) uniq_values_in_categories = pd.DataFrame.from_items([('cat_name',cat_features),('unique_values', cat_uniques)]) uniq_values_in_categories

在这里插入图片描述

fig, (ax1, ax2) = plt.subplots(1,2) fig.set_size_inches(16,5) ax1.hist(uniq_values_in_categories.unique_values, bins=50) ax1.set_title('Amount of categorical features with X distinct values') ax1.set_xlabel('Distinct values in a feature') ax1.set_ylabel('Features') ax1.annotate('A feature with 326 vals', xy=(322, 2), xytext=(200, 38), arrowprops=dict(facecolor='black')) ax2.set_xlim(2,30) ax2.set_title('Zooming in the [0,30] part of left histogram') ax2.set_xlabel('Distinct values in a feature') ax2.set_ylabel('Features') ax2.grid(True) ax2.hist(uniq_values_in_categories[uniq_values_in_categories.unique_values


【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3